NMBS/SNCB trains, which is the biggest railway serving the country of Belgium, needs to postpone the expansion because of lack of train personnel staff. Also, the raise on energy cost increase the operating cost. Meanwhile, the company still face revenue loss because of low passengers on some routes.
With these problems, the company can strategically plan their policies on optimizing resource allocation, enhancing operational cost efficiency, and also increasing revenue especially on their low occupancy routes. But how can they do that?
This project is aim to predict occupancy levels (categorized by high and low) for different Origin-Destination (OD) pairs of Belgium railway system to allocate resources such as train frequency and size effectively to help transportation planners of NMBS/SNCB or the Transportation Department in Belgium.
e.g. extend/merge train routes, minimize vacant trains or mitigate crowdedness of trains etc.
Question:
What is the level of occupancy of a certain OD pair at a specific time?
What options do planners have to manage this demand?
e.g. decrease amount of trains, merge of two similar routes, or decrease the frequent of the route to change low routes to medium, and of increase amount of trains, separate two similar routes, or increase the frequent of the route to change high routes to medium.
We build this web-based application, called Re-Train that predict the occupancy rate across the routes with three stages, first analysing and visualizing past data, creating long-term or short-term predictions based on the users’ need, and last, showcasing potential benefits and risk on the predictions.
This is the interface which shows the latest occupancy rate where orange shows low occupancy routes. First feature is History trend analysis. On the top-right corner, user can filter based on those variables and click analysis, to see graph and maps of the train occupancy level in the past.
Second feature is prediction. you can predict by filtering like the first one, and choose the period time based on your needs. After clicking the predict, the predicted occupancy rate will result in the percentage of each level and be shown in the map, and also the accuracy percentage & graph below the map.
Last feature, you can check on the cost benefit analysis based on the scenario of the prediction which will show you the potential benefit & potential risk of optimizing routes on those period of time. Then, you can export the report to help you make decision based on that.
How it works?
We use Kaggle train occupancy level data on 2016 which have these variables that potentially correlate with the occupancy rate. And add some external datas like weather, demographic, and fare, and then predict using logistic regression.
From this, we can see high-demand OD pairs are consistent, while low-demand OD pairs keep on changing. High occupancy is concentrated on major intercity and commuter routes. Meanwhile, smaller or regional routes shows reduced demand. But there are two quite big OD Pairs that dominate low-occupancy, possibly indicating a mismatch in train frequency with actual demand.
The modeling approach uses binomial regression to predict train occupancy (high or low) while identifying key factors that are validated through k-fold cross-validation for robustness. Also, a confusion matrix will assess the model’s predictive accuracy and reliability which is shown in the app through accuracy percentage and graph.
To minimize the risks of misclassifying high-occupancy routes as low, we set a higher threshold to ensure only routes with a strong likelihood of low occupancy are chosen just like shown on the animation example, It will prioritize service reliability and passenger satisfaction.
By correctly predicting low-occupancy routes, we can achieve significant cost savings in maintenance, labor, and energy while avoiding wasted resources. However, misclassifying high-occupancy as low can lead to revenue loss, overcrowding, and passenger dissatisfaction. Striking the right balance in predictions ensures optimized resource allocation and supports a sustainable and efficient rail network.
We loaded libraries that are required to analyze and visualize the data as shown in the code below.
knitr::opts_chunk$set(echo = TRUE)
#install.packages("hms")
#install.packages("BelgiumStatistics", repos = "http://www.datatailor.be/rcube", type = "source")
#install.packages("devtools")
#library(devtools)
#devtools::install_github("jwijffels/BelgiumMaps.Admin", build_vignettes = TRUE)
#devtools::install_github("jwijffels/StatisticsBelgium", build_vignettes = TRUE)
#library(BelgiumStatistics)
library(tidyverse)
library(tidycensus)
library(sf)
library(knitr)
library(kableExtra)
library(caret)
library(pscl)
library(mapview)
library(dplyr)
library(scales)
library(viridis)
library(spdep)
library(caret)
library(plotROC)
library(pROC)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(jtools) # for regression model plots
library(broom)
library(tufte)
library(rmarkdown)
library(pander)
library(classInt)
library(ggplot2)
library(units)
library(leaflet)
library(lubridate)
library(hms)
library(riem) # for weather
library(readxl) # for read xlsx
library(ggtext)
library(showtext) # to set up font in plots
font_add_google("Roboto", "roboto") # to set up font in plots
showtext_auto(TRUE)
library(geosphere) # to calculate distance from two different geolocation (lat, lng)
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
colorPallete <- c("low" = "#df543b", "medium" = "#a1b7b8", "high" = "#a1b7b8")
plotTheme <- theme(
plot.title = element_text(face="bold", hjust = 0, size=15, lineheight=0.8),
plot.subtitle = element_text(hjust = 0, size = 10, face = "italic", lineheight=0.8, margin = margin(b = 3, t = 6)),
plot.caption = element_text(size = 10, hjust = 0, lineheight=0.9),
plot.margin = margin(1.7, 1.7, 1.7, 1.7),
text = element_text(family = "roboto"),
legend.title = element_text(size = 10))
We loaded three default data (1)-(3) of NMBS trains, stations, and lines from Kaggle, and four additional data (4)-(7) of Belgium statistical tracts, Belgium population data, zone station data, and weather data from RIEM package. [1]
trains, trains_test,
trains_train, trains_trainTest,
trains_trarinTrain)The training dataset of trains provides date, time, origin station
code (from), destination station code (to),
vehicle ID, and occupancy information shown at three different levels
(low, medium, high), from July 27, 2016 to October 29, 2016. The test
dataset of trains have the same columns as the training dataset, but
from October 29, 2016 to December 19, 2016. The test dataset does not
have occupancy level information.
stations)The stations dataset contains names, locations (longitude and
latitude), and average stop times (avg_stop_times). Since
NMBS operates trains in Belgium as well as France and the Netherlands,
we filtered the dataset to exclude non-Belgian stations (e.g. those in
the Netherlands, France, Luxembourg etc.). Additionally, we created
columns to calculate the distance between origin and destination
stations, the distance from Belgium to the origin and destination
stations, and the bearing between origin and destination stations.
line_info)The lines dataset includes information about the vehicle ID, vehicle type, the number of stops, and stopping station IDs. Since the information in the train dataset is insufficient, we did not use stopping station IDs in this analysis.
statBel and
statBel_muni)The Belgium statistical tracts dataset contains the geometries of each tract. We gained this dataset from Statbel, the Belgian statistical office. We joined this dataset to calculate the population of Belgium by municipality and intersect this dataset to the stations and the weather stations.
popBel)We assumed that the population of the origin and destination regions would affect the occupancy level of the trains. We used the Belgium population dataset to calculate the population of Belgium by municipality and joined this dataset to the statistical tracts dataset.
stationZone)The zone station dataset contains the zone information of each train station. At first, we assumed the fare information may be one of the independent variable so we added this information to the stations dataset to calculate the fare. However, we could not find the opened fare information of NMBS, so we did not use this information in the analysis.
weather_belgium,
weatherstations_with_muni)We assumed that the weather conditions of the origin and destination regions at specific dates and times would affect the occupancy level of the trains. The dataset is from RIEM, which is developed by Iowa Environment Mesonet. Since RIEM provides global coverage, we were able to get temperature, wind, and relative humidity data from 12 weather stations located in Belgium airports. Since the percipitation data of Belgium was unavailable in RIEM, we used relative humidity as a substitute variable to reflect general patterns related to precipitation.
# Load (1) trains, (2) stations, and (3) SNCB lines data
line_info <- read.csv("./Data/line_info.csv")
trains_test <- read.csv("./Data/trains_test.csv")
trains_train <- read.csv("./Data/trains_train.csv")
stations <- read.csv("./Data/stations.csv")
# Load (4) Belgium statistical tracts and (5) population data
statBel <- st_read("./Data/sh_statbel_statistical_sectors_3812_20240101.geojson") %>%
st_transform(4326)
popBel <- read_excel("./Data/TF_POP_STRUCT_SECTORS_2016.xlsx") # 2016 Belgium population data by tracts
# Load Zone station data to calculate fare
stationZone <- read_excel("./Data/station-zone.xlsx")
# Filter stations data to exclude non-Belgian stations (Netherlands, France, etc.)
stations <- stations %>%
filter(country.code=="be")
stations <- left_join(stations, stationZone, by = c("name" = "Station")) # Add zone information to stations)
# Stat tracts and population of Belgium by municipality
statBel_muni <- statBel %>%
group_by(cd_dstr_refnis) %>%
summarize(geometry = st_union(geometry)) %>%
ungroup()
statBel_muni <- statBel_muni %>%
left_join(
statBel %>% st_drop_geometry() %>% select(cd_dstr_refnis, tx_adm_dstr_descr_fr) %>% distinct(),
by = "cd_dstr_refnis"
)
popBel_muni <- popBel %>%
mutate(CD_REFNIS_N = as.character(as.numeric(substr(CD_REFNIS, 1, 2)) * 1000)) %>%
select(CD_REFNIS_N, POPULATION, TX_DESCR_FR) %>%
group_by(CD_REFNIS_N) %>%
summarize(POPULATION = sum(POPULATION))
statBel_muni <- left_join(statBel_muni, popBel_muni, by = c("cd_dstr_refnis" = "CD_REFNIS_N"))
# Clean up the URI column to extract the station IDs
stations <- stations %>%
mutate(station_id = gsub("http://irail.be/stations/NMBS/", "", URI))
# Convert stations dataset to sf object
stations_sf <- stations %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326)
We transformed raw data into a more usable format for analysis through the following processs. First, we combined the training and test datasets, which were initially separated, to manipulate the entire dataset at once. After we finished the data wrangling, we split them back into training and test sets.
Second, we recoded occupancy data from three levels (low, medium, high) to two levels (low, high). This was done to focus more on analyzing trends in low occupancy data, and to simplify regression analysis process by using a binomial model instead of a multinomial one.
Low: There are plenty of seats left.
Medium: It is hard to find a seat and it is difficult to sit together.
High: There are no seats left and people have to stand up
Low: There are plenty of seats left.
High: It is hard to find a seat and it is difficult to sit together or there are no seats left and people have to stand up.
Lastly, we created a panel dataset called trains by
performing a left_join of internal and external datasets. This
incorporated both dependent and independent variables of the following
into one table. The detailed is as shown in Table 1.
Primary internal data: datetime (date and time,
dbl), week (week number, chr), dotw (day of
week), interval60 (hourly interval), from_x
(origin station information, e.g. name, latitude, longitude etc.),
to_x (destination station information, e.g. name, latitude,
longitude etc.), vehicle (train number),
occupancy_original (occupancy of three levels),
occupancy (occupancy of two levels),
avg_stop_times (average stop times of each line),
n_stops (number of stops of each line)
Primary external data:
Administration sector: cd_dstr_refnis (borough
number), tx_adm_dstr_descr_fr (name of borough that origin
and destination stations are located)
Population: POPULATION (population of borough that
origin and destination stations are located)
Weather: temp, humid, wind
(temperature, humidity, and wind speed of each hour of day at origin and
destination stations), nearest_weather_station (the nearest
weather station location of origin and destination stations)
Event: event (weekend and holiday information)
Event: 0 = Weekday, 1 = Weekend or holiday
| Variable | Type | Description |
|---|---|---|
from/to
|
<chr> | 9-digit ID of each origin and destination station |
vehicle
|
<chr> | ID of train |
occupancy_original
|
<chr> | Train occupancy of three levels (low, medium, high) |
occupancy
|
<chr> | Train occupancy of two levels (low, high) |
interval60
|
<dttm> | Hourly interval |
week
|
<dbl> | Week number |
datetime
|
<dttm> | Date and time |
dotw
|
<ord> | Day of week |
from/to_station
|
<chr> | Name of each origin and destination station |
from/to_avg_stop_times
|
<dbl> | The average number of vehicles stopping each day in this station |
from/to_lng
|
<dbl> | Longitude of each origin and destination station |
from/to_lat
|
<dbl> | Latitude of each origin and destination station |
from/to_pop
|
<dbl> | Population of each origin and destination station |
from/to_cd_dstr_refnis
|
<chr> | 5-digit borough number that each origin and destination station is located |
from/to_tx_adm_dstr_descr_fr
|
<chr> | Name of a borough that each origin and destination station is located |
from/to_nearest_weatherstation
|
<chr> | 4-letter code of the nearest weather station from each origin and destination train station |
from/to_temp
|
<dbl> | Temperature of each origin and destination train station in Celsius |
from/to_humid
|
<dbl> | Relative humidity of each origin and destination train station in percentage |
from/to_wind
|
<dbl> | Wind of each origin and destination train station in meters per second |
vehicle_type
|
<chr> | 4-letter code of vehicle type |
nr_of_stops
|
<int> | Number of stops of this O-D pair |
event
|
<dbl> | Weekend and holiday information (0 = Weekday, 1 = Weekend or holiday) |
n_high
|
<int> | Frequency of high occupancy of this O-D pair during Jul-Oct 2016 |
n_low
|
<int> | Frequency of low occupancy of this O-D pair during Jul-Oct 2016 |
dist_from_O_to_Brussels
|
<dbl> | Distance from an origin station to Brussels in kilometers |
dist_from_D_to_Brussels
|
<dbl> | Distance from a destination station to Brussels in kilometers |
dist_from_O_to_D
|
<dbl> | Distance between an origin and a destination station in kilometers |
bearing_from_O_to_D
|
<dbl> | Bearing angle from an origin to a destination station |
bearing_from_O_to_D_cat
|
<chr> | Categorized bearing angle (N, S, E, W, NE, NW, SE, SW) |
time_cat
|
<fct> | Categorized time by commute, morning, noon, evening, midnight |
date_numeric
|
<dbl> | Numerized date |
time_numeric
|
<dbl> | Numerized time |
occupancy_numeric
|
<dbl> | Numerized occupancy (0 = high, 1 = low) |
# Change station code from numeric to character and rbind train and test sets
trains_test$from <- as.character(trains_test$from)
trains_test$from <- paste0("00", trains_test$from)
trains_test$to <- as.character(trains_test$to)
trains_test$to <- paste0("00", trains_test$to)
trains_test$occupancy <- NA
# Rbind train and test sets
trains <- rbind(trains_train, trains_test)
#trains$occupancy <- factor(trains$occupancy, levels = c("low", "medium", "high"))
# Believe this causes regression error
# Convert date and time to POSIX datetime
trains$datetime <- as.POSIXct(paste(trains$date, trains$time), format = "%Y-%m-%d %I:%M:%S %p")
trains$week <- week(trains$datetime)
trains$dotw <- wday(trains$datetime, label = TRUE)
# Update vehicle ID
trains <- trains %>%
mutate(vehicle = case_when(
grepl("^\\d+$", vehicle) & vehicle %in% c("7006", "7966", "8011", "8015") ~ paste0("P", vehicle),
grepl("^\\d+$", vehicle) ~ paste0("IC", vehicle),
TRUE ~ vehicle
))
# Recode Low-medium-high to Low-high
trains$occupancy_original <- trains$occupancy
trains <- trains %>%
mutate(occupancy = recode(occupancy, "medium" = "high")) %>% # (1) Recode medium to high
# mutate(occupancy = ifelse(occupancy == "medium", NA_character_, occupancy)) %>%
# (2) Remove medium to high
mutate(interval60=ymd_h(substr(datetime, 1, 13))) %>%
mutate(week=week(interval60))
As of 2016, the population of Belgium is 11,267,910. In Belgium, 43
municipalities (CD_DSTR_REFNIS), 589 Reference from
National Institute of Statistics Code (CD_REFNIS), and
2,882 statistical sectors (CD_SECTOR). We performed a left
join to add the municipality where the origin and destination stations
are located and their population, to the trains
dataset.
statBel_muni <- statBel_muni %>%
mutate(tx_adm_dstr_descr_fr_appr = str_extract(tx_adm_dstr_descr_fr,
"(?<=Arrondissement d’|Arrondissement de )\\b[\\wÀ-ÿ\\s-]+\\b"))
ggplot()+
geom_sf(data = statBel_muni, aes(fill = POPULATION), color = "white")+
geom_sf_text(
data = statBel_muni,
aes(label = tx_adm_dstr_descr_fr_appr),
size = 3,
color = "white"
) +
labs(
title = "Population of Belgium by Municipality",
subtitle = "2016",
caption="Source: StatBel"
) +
theme_void()+plotTheme+
theme(legend.position = "bottom")
## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
## give correct results for longitude/latitude data
stations_with_pop <- st_join(stations_sf, statBel_muni) %>%
st_drop_geometry()
stations_with_pop <- stations_with_pop %>%
bind_cols(as.data.frame(st_coordinates(stations_sf))) %>%
rename(longitude = X, latitude = Y)
# Add from and to station names and left join population into trains data
trains <- trains %>%
left_join(stations_with_pop, by = c("from" = "station_id"))
## Warning in left_join(., stations_with_pop, by = c(from = "station_id")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 87 of `x` matches multiple rows in `y`.
## ℹ Row 92 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
trains <- trains %>%
select(date, time, from, to, vehicle, occupancy_original, occupancy, interval60, week, datetime, dotw, name, avg_stop_times, longitude, latitude, cd_dstr_refnis, tx_adm_dstr_descr_fr, POPULATION) %>%
rename(from_station = name) %>%
rename(from_avg_stop_times = avg_stop_times) %>%
rename(from_lng = longitude) %>%
rename(from_lat = latitude) %>%
rename(from_pop = POPULATION) %>%
rename(from_cd_dstr_refnis = cd_dstr_refnis) %>%
rename(from_tx_adm_dstr_descr_fr = tx_adm_dstr_descr_fr)
trains <- trains %>%
left_join(stations_with_pop, by = c("to" = "station_id"))
## Warning in left_join(., stations_with_pop, by = c(to = "station_id")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 86 of `x` matches multiple rows in `y`.
## ℹ Row 92 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
trains <- trains %>%
select(date, time, from, to, vehicle, occupancy_original, occupancy, interval60, week, datetime, dotw, from_station, from_avg_stop_times, from_lng, from_lat, from_pop, from_cd_dstr_refnis, from_tx_adm_dstr_descr_fr, name, avg_stop_times, longitude, latitude, cd_dstr_refnis,tx_adm_dstr_descr_fr, POPULATION) %>%
rename(to_station = name) %>%
rename(to_avg_stop_times = avg_stop_times) %>%
rename(to_lng = longitude) %>%
rename(to_lat = latitude) %>%
rename(to_pop = POPULATION) %>%
rename(to_cd_dstr_refnis = cd_dstr_refnis)%>%
rename(to_tx_adm_dstr_descr_fr = tx_adm_dstr_descr_fr)
# Load weather data
weatherstations_sf <- riem_stations(network = "BE__ASOS") %>% # BE__ASOS = Belgium ASOS (Weather Stations)
st_as_sf(coords = c("longitude", "latitude"), crs = 4326)
weatherstations <- weatherstations_sf %>%
pull(id)
weather_belgium <- data.frame()
for (station in weatherstations) { # Loop through each station
weather_data <- riem_measures( # Fetch data for the current station
station = station,
date_start = "2016-07-27",
date_end = "2016-12-19"
) %>%
select(valid, tmpf, relh, sknt) %>% # Select relevant columns (timestamp, temperature by Fahrenheit, relative humidity in percentage, wind speed in knots)
mutate(station_id = station) # Add a column to identify the station
weather_belgium <- bind_rows(weather_belgium, weather_data) # Append the fetched data to the main data frame
}
weather_belgium <- weather_belgium %>%
mutate(interval60 = ymd_h(substr(valid, 1, 13))) %>% # Extract the timestamp and convert it to a POSIXct object
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE)) %>%
group_by(interval60, station_id) %>%
summarize(Temp = (max(tmpf, na.rm = TRUE)-32)*5/9, # Temperature by Celsius
Humid = max(relh, na.rm = TRUE), # relative humidity in percentage
Wind = max(sknt, na.rm = TRUE)*0.51444) # Wind speed in meters per second
## Warning: There were 337 warnings in `summarize()`.
## The first warning was:
## ℹ In argument: `Temp = (max(tmpf, na.rm = TRUE) - 32) * 5/9`.
## ℹ In group 6078: `interval60 = 2016-08-18 10:00:00` and `station_id = "EBBE"`.
## Caused by warning in `max()`:
## ! no non-missing arguments to max; returning -Inf
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 336 remaining warnings.
## `summarise()` has grouped output by 'interval60'. You can override using the
## `.groups` argument.
# Convert weather dataset to sf object
weatherstations_with_muni <- st_join(weatherstations_sf, statBel_muni) %>%
st_drop_geometry()
# Find nearest weather station for municipalities without a station
statBel_muni <- statBel_muni %>%
mutate(
weatherstation_within = st_intersects(geometry, weatherstations_sf, sparse = FALSE) %>%
apply(1, function(x) if (any(x)) weatherstations_sf$id[which(x)[1]] else NA) # Replace `id` with the column identifying stations
)
statBel_muni <- statBel_muni %>%
mutate(
nearest_weatherstation = ifelse(
is.na(weatherstation_within),
weatherstations_sf$id[st_nearest_feature(geometry, weatherstations_sf)], # Replace `id` with station identifier
weatherstation_within
)
)
statBel_muni <- statBel_muni %>% select(-weatherstation_within)
# left join weather station data into trains data
trains <- trains %>%
left_join(statBel_muni, by=c("from_cd_dstr_refnis" = "cd_dstr_refnis")) %>%
rename(from_nearest_weatherstation = nearest_weatherstation) %>%
select(-geometry, -POPULATION)
trains <- trains %>%
left_join(statBel_muni, by=c("to_cd_dstr_refnis" = "cd_dstr_refnis")) %>%
rename(to_nearest_weatherstation = nearest_weatherstation) %>%
select(-geometry, -POPULATION)
# left join weather data into trains data
trains <- trains %>%
left_join(weather_belgium, by = c("interval60" = "interval60", "from_nearest_weatherstation" = "station_id")) %>%
mutate(from_temp = Temp) %>%
mutate(from_humid = Humid) %>%
mutate(from_wind = Wind) %>%
select(-Temp, -Humid, -Wind)
trains <- trains %>%
left_join(weather_belgium, by = c("interval60" = "interval60", "to_nearest_weatherstation" = "station_id")) %>%
mutate(to_temp = Temp) %>%
mutate(to_humid = Humid) %>%
mutate(to_wind = Wind) %>%
select(-Temp, -Humid, -Wind)
trains <- trains %>%
mutate(from_humid = case_when(is.infinite(from_humid) ~ NA_real_,
TRUE ~ from_humid)) %>%
mutate(to_humid = case_when(is.infinite(to_humid) ~ NA_real_,
TRUE ~ to_humid))
ggplot()+
geom_sf(data=statBel_muni , fill="#a1b7b8", color="#333")+
geom_sf(data=weatherstations_sf, color="#df543b", size=1)+
geom_sf_text(data = statBel_muni ,
aes(label = cd_dstr_refnis),
size = 3, color = "black") +
labs(
title = "Weather Station Location in Belgium",
subtitle = "Weather data from RIEM package",
caption = "Source: StatBel, RIEM package by Maelle Salmon"
) +
theme_void()+plotTheme
grid.arrange(top = "Weather DATA - Belgium - Jul - Dec 2016",
ggplot(weather_belgium, aes(interval60, Humid)) + geom_line(aes(color=station_id))+plotTheme,
ggplot(weather_belgium, aes(interval60, Wind)) + geom_line(aes(color=station_id))+plotTheme,
ggplot(weather_belgium, aes(interval60, Temp)) + geom_line(aes(color=station_id))+plotTheme)
Train types in Belgium is as the following.[2] In this analysis, we removed “EUR”, “ICE”, “ICT”, “TGV”, “TRN”, and other undefined vehicle code.
InterCity (IC): IC trains connect Belgium’s large cities. These trains only stop at the biggest train stations and sometimes cross international borders.
Peak (P): P trains run during peak travel times. They provide additional alternatives when you’re travelling during busy periods. Most of these trains run in the mornings and in late afternoon.
Local trains (L): L trains generally connect cities, but they also stop at every station along the route.
S-trains (S): S trains are suburban trains that connect the city centre with the surrounding communes. S trains stop in most stations along the route.
EXTRA: Additional train services, used at exceptionally busy periods. For example, these are the trains that travel towards the Belgian coast on very sunny days.
T (Touristic): Additional train service to certain tourist destinations.
EXP (Coast Express): Extra train service to the Belgian coast during the summer period. International trains (INT = EC, THA, TGV, ICE, EST): Regular international trains, namely Eurocity, Thalys, TGV, ICE and Eurostar.
# Left join line_data into trains data
trains <- trains %>%
left_join(line_info, by = c("vehicle" = "vehicle_id"))
# Clean up vehicle type
trains <- trains %>%
mutate(
vehicle_type = ifelse(
is.na(vehicle_type) | vehicle_type == "",
gsub("[0-9]", "", vehicle),
vehicle_type
)
)
trains <- trains %>%
mutate(vehicle_type = case_when(
vehicle_type == "ic" & row_number() == min(which(vehicle_type == "ic")) ~ "IC",
TRUE ~ vehicle_type
))
trains <- trains %>%
mutate(vehicle = case_when(
vehicle == "ic2029" & row_number() == min(which(vehicle == "ic2029")) ~ "IC2029",
TRUE ~ vehicle
))
# Filter few vehicle_type
trains <- trains %>%
filter(!(vehicle_type %in% c("EUR", "ICE", "ICT", "TGV", "TRN", "undefined", "(null)")))
# Create event data for weekends and holidays
trains <- trains %>%
mutate(event = case_when(dotw == "Sat" ~ 1,
dotw == "Sun" ~ 1,
ymd(interval60) >= "2016-07-28" & ymd(interval60) <= "2016-09-04" ~ 1,
ymd(interval60) == "2016-11-03" | ymd(interval60) == "2016-11-04" | ymd(interval60) == "2016-11-05" ~ 1,
ymd(interval60) == "2016-11-13" | ymd(interval60) == "2016-11-14" | ymd(interval60) == "2016-11-15" ~ 1,
TRUE ~ 0))
## Warning: There were 8 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `event = case_when(...)`.
## Caused by warning:
## ! 3732 failed to parse.
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 7 remaining warnings.
# Calculate the distance from O to Brussels, from D to Brussels, and from O to D in kilometers
trains <- trains %>%
mutate(
dist_from_O_to_Brussels = pmap_dbl(
list(from_lng, from_lat),
~ distHaversine(c(4.3517, 50.8503), c(..1, ..2)) / 1000
),
dist_from_D_to_Brussels = pmap_dbl(
list(to_lng, to_lat),
~ distHaversine(c(4.3517, 50.8503), c(..1, ..2)) / 1000
),
dist_from_O_to_D = pmap_dbl(
list(from_lng, from_lat, to_lng, to_lat),
~ distHaversine(c(..1, ..2), c(..3, ..4)) / 1000
)
)
# Calculate the bearing angle from O to D
deg2rad <- function(deg) {
return(deg * pi / 180)
}
rad2deg <- function(rad) {
return(rad * 180 / pi)
}
calculate_bearing <- function(lat1, lon1, lat2, lon2) {
phi1 <- deg2rad(lat1)
phi2 <- deg2rad(lat2)
delta_lambda <- deg2rad(lon2 - lon1)
theta <- atan2(
sin(delta_lambda) * cos(phi2),
cos(phi1) * sin(phi2) - sin(phi1) * cos(phi2) * cos(delta_lambda)
)
bearing <- (rad2deg(theta) + 360) %% 360
return(bearing)
}
trains$bearing_from_O_to_D <- calculate_bearing(trains$from_lat, trains$from_lng, trains$to_lat, trains$to_lng)
trains$bearing_from_O_to_D_cat <- case_when(
trains$bearing_from_O_to_D >= 337.5 | trains$bearing_from_O_to_D < 22.5 ~ "N",
trains$bearing_from_O_to_D >= 22.5 & trains$bearing_from_O_to_D < 67.5 ~ "NE",
trains$bearing_from_O_to_D >= 67.5 & trains$bearing_from_O_to_D < 112.5 ~ "E",
trains$bearing_from_O_to_D >= 112.5 & trains$bearing_from_O_to_D < 157.5 ~ "SE",
trains$bearing_from_O_to_D >= 157.5 & trains$bearing_from_O_to_D < 202.5 ~ "S",
trains$bearing_from_O_to_D >= 202.5 & trains$bearing_from_O_to_D < 247.5 ~ "SW",
trains$bearing_from_O_to_D >= 247.5 & trains$bearing_from_O_to_D < 292.5 ~ "W",
trains$bearing_from_O_to_D >= 292.5 & trains$bearing_from_O_to_D < 337.5 ~ "NW"
)
The following plot shows a distribution between high and low occupancy levels by hour of the day. The train schedule was concentrated at 06:00-08:59 and 16:00-19:59. This result corresponds to the typical rush hour in Belgium, which is 08:00-17:00. Despite the large number of trains in this time zone, the low occupancy rate was low at around 33%. However, the low occupancy rate was about 62% at 10:00-14:59.
# Categorize time
time_category <- function(time) {
hour <- as.numeric(format(time, "%H"))
if ((hour >= 6 & hour < 9) | (hour >= 16 & hour < 20)) {
return("commute time")
} else if (hour >= 0 & hour < 6) {
return("morning")
} else if (hour >= 9 & hour < 16) {
return("noon")
} else {
return("evening")
}
}
trains <- trains %>%
mutate(time_cat = sapply(datetime, time_category))
trains$time_cat <- factor(trains$time_cat, levels = c( "morning", "commute time", "noon", "evening"))
The map below shows the location of 567 train stations in Belgium. The colored stations are primary stations that represents each zone. SNCB operates trains in Belgium as well as France and the Netherlands, but we filtered the dataset to exclude non-Belgian stations (e.g. those in the Netherlands, France, Luxembourg etc.) in this analysis.
# Create a leaflet map
pal <- colorFactor(
palette = "Set1",
domain = stations_sf$Zone
)
leaflet(stations_sf) %>%
addTiles() %>%
addCircleMarkers(
label = ~name,
radius = 2,
color = ~pal(Zone),
fillOpacity = 0.9
) %>%
addLegend("bottomright", pal=pal, values=~Zone, labels = "Stations", title = "Station Map")
Trains data from July to October 2016 shows that more than 40% of trains had low occupancy as shown below.
# Occupancy distribution
trains %>%
subset(!is.na(occupancy)) %>%
count(occupancy) %>%
ggplot(aes(x = occupancy, y = n, fill = occupancy)) +
geom_bar(stat = "identity") +
geom_text(
aes(label = paste0(n, " (", round(n / sum(n) * 100, 1), "%)")),
vjust = -0.5
) +
scale_fill_manual(values = colorPallete) +
labs(
title = "Distribution of Train Occupancy Levels",
subtitle = "Training data, July - Oct 2016",
x = "Occupancy Level",
y = "Count of Trains"
) +
theme_minimal()+plotTheme+
theme(legend.position = "none")
The following plot shows a distribution between high and low occupancy levels by hour of the day. The train schedule was concentrated at 06:00-08:59 and 16:00-19:59. This result corresponds to the typical rush hour in Belgium, which is 08:00-17:00. Despite the large number of trains in this time zone, the low occupancy rate was low at around 33%. However, the low occupancy rate was about 62% at 10:00-14:59.
# Time trend
trains %>%
subset(!is.na(occupancy)) %>%
group_by(hour = hour(datetime), occupancy) %>%
count() %>%
ggplot(aes(x = factor(hour), y = n, fill = occupancy)) +
geom_bar(stat = "identity") +
geom_text(
aes(label = paste0(n)),
position = position_stack(vjust = 0.5)
) +
scale_fill_manual(values = colorPallete) +
labs(
title = "Train Occupancy Trend by Hour of the Day",
subtitle = "Training data, July - Oct 2016",
x = "Hour of Day",
y = "Count of Trains",
fill = "Occupancy Level"
) +
theme_minimal()+plotTheme+
theme(legend.position = "bottom")
sum_trains <- trains %>%
subset(!is.na(occupancy)) %>%
group_by(hour = hour(datetime), occupancy) %>%
count() %>%
pivot_wider(names_from = occupancy, values_from = n, values_fill = 0) %>%
mutate(
ratio = low / (low + high),
total = low + high
)
trains %>%
subset(!is.na(occupancy)) %>%
group_by(hour = hour(datetime)) %>%
count(occupancy) %>%
ggplot(aes(x = hour, y = n, color = occupancy)) +
geom_line() +
scale_color_manual(values = colorPallete) +
labs(
title = "Train Occupancy Trend by Hour of the Day",
subtitle = "Training data, July - Oct 2016",
x = "Hour of Day",
y = "Count"
) +
theme_minimal()+plotTheme+
theme(legend.position = "bottom")
The proportion of low occupancy routes on weekdays is lower than on weekends. According to the plot below, the average rate of low occupancy routes is 40.7% on weekdays (Monday to Friday), while 43.8% on weekends (Saturday and Sunday).
trains %>%
subset(!is.na(occupancy)) %>%
mutate(date = as.Date(datetime)) %>%
group_by(dotw, occupancy) %>%
count() %>%
ggplot(aes(x = dotw, y = n, fill=occupancy)) + # Use color for lines
geom_bar(stat = "identity") +
geom_text(
aes(label = paste0(n)),
position = position_stack(vjust = 0.5)
) +
scale_fill_manual(values = colorPallete) +
labs(
title = "Train Occupancy Trend by Day of Week",
subtitle = "Training data, July - Oct 2016",
x = "Day",
y = "Count of Trains",
color = "Occupancy Level"
) +
theme_minimal()+plotTheme+
theme(legend.position = "bottom")
sum_trains2 <- trains %>%
subset(!is.na(occupancy)) %>%
group_by(dotw, occupancy) %>%
count() %>%
pivot_wider(names_from = occupancy, values_from = n, values_fill = 0) %>%
mutate(
ratio = low / (low + high),
total = low + high
)
trains %>%
subset(!is.na(occupancy)) %>%
mutate(date = as.Date(datetime)) %>%
group_by(dotw, occupancy) %>%
summarise(n = n(), .groups = "drop") %>% # Use summarise instead of count
ggplot(aes(x = dotw, y = n, color = occupancy, group = occupancy)) +
geom_line(size = 1) +
scale_color_manual(values = colorPallete) +
labs(
title = "Train Occupancy Trend by Day of Week",
subtitle = "Training data, July - Oct 2016",
x = "Day",
y = "Count of Trains",
color = "Occupancy Level"
) +
theme_minimal()+plotTheme+
theme(legend.position = "bottom")
When comparing the trains data from mid-August to
mid-September with the data from mid-September to the end of October, we
observed that the number of record is significantly different. One
possible reason we guess is missing data. Another reason could be that
SNCB holiday period from the end of July to end of August, during which
certain IC, S and P trains do not run, so that there is possibly a
reduced number of trains during this period. In this analysis, we
assumed that there were no missing data, and set event as 1 for the date
of holiday period as well as weekends and public holidays.
trains %>%
subset(!is.na(occupancy)) %>%
mutate(date = as.Date(datetime)) %>%
group_by(date, hour = hour(datetime), occupancy) %>%
count() %>%
ggplot(aes(x = date, y = n, fill=occupancy)) + # Use color for lines
geom_bar(stat = "identity") + # Stacked bar chart
scale_fill_manual(values = colorPallete) +
labs(
title = "Train Occupancy Trend by Date",
subtitle = "Training data, July - Oct 2016",
x = "Date and Hour",
y = "Count of Trains",
color = "Occupancy Level"
) +
theme_minimal()+plotTheme+
theme(legend.position = "bottom")
sum_trains3 <- trains %>%
subset(!is.na(occupancy)) %>%
group_by(date, dotw, occupancy) %>%
count() %>%
pivot_wider(names_from = occupancy, values_from = n, values_fill = 0) %>%
mutate(
ratio = low / (low + high),
total = low + high
) %>%
filter(total >=5)
trains %>%
subset(!is.na(occupancy)) %>%
group_by(date = date(datetime)) %>%
count(occupancy) %>%
ggplot(aes(x = date, y = n, color = occupancy)) +
geom_line() +
scale_color_manual(values = colorPallete) +
labs(
title = "Train Occupancy Trend by Date",
subtitle = "Training data, July - Oct 2016",
x = "Date",
y = "Count"
) +
theme_minimal()+plotTheme+
theme(legend.position = "bottom")
In sum, most prominent low occupancy routes are during off-peak hours, on weekends, and scattered throughout the date on early July-August. Meanwhile, high occupancy routes are strong during commuter hour, on weekdays, and on mid-late October.
# Filter high occupancy
high_occupancy <- trains %>% subset(!is.na(occupancy)) %>%
filter(occupancy == "high") %>%
count(from_station, to_station) %>%
arrange(desc(n)) %>%
rename(n_high = n)
low_occupancy <- trains %>% subset(!is.na(occupancy)) %>%
filter(occupancy == "low") %>%
count(from_station, to_station) %>%
arrange(desc(n))%>%
rename(n_low = n)
# Create and left join the frequency (count) of high and low occupancy data of origin and destination station by each OD pair
trains <- trains %>%
left_join(high_occupancy, by = c("from_station" = "from_station", "to_station" = "to_station"))
trains <- trains %>%
left_join(low_occupancy, by = c("from_station" = "from_station", "to_station" = "to_station"))
occupancy <- high_occupancy %>%
left_join(low_occupancy, by = c("from_station" = "from_station", "to_station" = "to_station"))
occupancy_data <- trains %>%
subset(!is.na(occupancy)) %>%
count(from_station, to_station, occupancy) %>%
group_by(from_station, to_station) %>%
mutate(total = sum(n))
occupancy_data %>%
filter(!is.na(to_station)) %>%
ungroup() %>%
arrange(desc(total)) %>%
slice_head(n = 70) %>% # Top 70 OD pairs
ggplot(aes(
x = reorder(paste(from_station, to_station, sep = " → "), total),
y = n,
fill = occupancy
)) +
geom_bar(stat = "identity") + # Stacked bar chart
coord_flip() +
scale_fill_manual(values = colorPallete) +
labs(
title = "Top 25 Count of Trains OD Pairs",
subtitle = "Training data, July - Oct 2016",
x = "OD Pair",
y = "Count of Trains",
fill = "Occupancy Level"
) +
theme_minimal()+plotTheme+
theme(legend.position = "bottom")
# Plot top 10 OD pairs with high occupancy
high_occupancy %>%
filter(!is.na(to_station)) %>%
head(10) %>%
ggplot(aes(x = reorder(paste(from_station, to_station, sep = " → "), n_high), y = n_high)) +
geom_bar(stat = "identity", fill = "#a1b7b8") +
coord_flip() +
labs(
title = "Top 10 High-Occupancy OD Pairs",
subtitle = "Training data, July - Oct 2016",
x = "OD Pair",
y = "Count"
) +
theme_minimal()+plotTheme
# Plot top 10 OD pairs with low occupancy
low_occupancy %>%
filter(!is.na(to_station)) %>%
head(10) %>%
ggplot(aes(x = reorder(paste(from_station, to_station, sep = " → "), n_low), y = n_low)) +
geom_bar(stat = "identity", fill = "#df543b") +
coord_flip() +
labs(
title = "Top 10 Low-Occupancy OD Pairs",
subtitle = "Training data, July - Oct 2016",
x = "OD Pair",
y = "Count"
) +
theme_minimal()+plotTheme
# Plot top 25 OD pairs with low occupancy percentage
high_occupancy %>%
left_join(low_occupancy, by = c("from_station" = "from_station", "to_station" = "to_station")) %>%
filter(!is.na(to_station)) %>%
mutate(perc = n_low/(n_high+n_low)*100) %>%
arrange(desc(perc))%>%
head(25) %>%
ggplot(aes(x = reorder(paste(from_station, to_station, sep = " → "), perc), y = perc)) +
geom_bar(stat = "identity", fill = "#df543b") +
coord_flip() +
labs(
title = "Top 25 Low-Occupancy Percentage OD Pairs",
subtitle = "Training data, July - Oct 2016",
x = "OD Pair",
y = "Percentage of Low Occupancy (%)"
) +
theme_minimal()+plotTheme
# Distribution summary
#summary(high_occupancy$n)
#summary(low_occupancy$n)
# group_by(from_station, to_station) %>%
# summarise(n = sum(n)) %>%
# ungroup()
# Visualize distribution
ggplot(low_occupancy, aes(x = n_low)) +
geom_histogram(binwidth = 1, fill = "blue", color = "white") +
labs(
title = "Distribution of Low Occupancy Counts",
x = "Count (n)",
y = "Frequency"
) +
theme_minimal()+plotTheme
trains %>%
filter(!is.na(occupancy)) %>%
select(occupancy, from_avg_stop_times, to_avg_stop_times, nr_of_stops) %>%
gather(Variable, value, -occupancy) %>%
ggplot(aes(occupancy, value, fill=occupancy)) +
geom_bar(position="dodge", stat="summary", fun="mean") +
facet_wrap(~Variable, scales="free") +
scale_fill_manual(values=colorPallete) +
labs(x="Occupancy", y="Mean",
title="Feature associations with the likelihood of occupancy Level",
subtitle="Average stop times of origin and destination station and the number of stops of each line\nTraining data, July - Oct 2016",
caption="Source: Kaggle")+
theme_minimal()+plotTheme+
theme(legend.position="none")
## Warning: Removed 159 rows containing non-finite outside the scale range
## (`stat_summary()`).
trains %>%
filter(!is.na(occupancy)) %>%
select(occupancy, from_wind, to_wind, from_temp, to_temp, from_humid, to_humid) %>%
gather(Variable, value, -occupancy) %>%
ggplot(aes(occupancy, value, fill=occupancy)) +
geom_bar(position="dodge", stat="summary", fun="mean") +
facet_wrap(~Variable, scales="free") +
scale_fill_manual(values=colorPallete) +
labs(x="Occupancy", y="Mean",
title="Feature associations with the likelihood of occupancy Level",
subtitle="Temperature and wind of origin and destination station\nTraining data, July - Oct 2016",
caption="Source: Kaggle")+
theme_minimal()+plotTheme+
theme(legend.position="none")
## Warning: Removed 568 rows containing non-finite outside the scale range
## (`stat_summary()`).
trains %>%
filter(!is.na(occupancy)) %>%
select(occupancy, dist_from_O_to_D, dist_from_O_to_Brussels, dist_from_D_to_Brussels) %>%
gather(Variable, value, -occupancy) %>%
ggplot(aes(occupancy, value, fill=occupancy)) +
geom_bar(position="dodge", stat="summary", fun="mean") +
facet_wrap(~Variable, scales="free") +
scale_fill_manual(values=colorPallete) +
labs(x="Occupancy", y="Mean distance (km)",
title="Feature associations with the likelihood of occupancy Level",
subtitle="Distance from O, D to Brussels, and from O to D\nTraining data, July - Oct 2016",
caption="Source: Kaggle")+
theme_minimal()+plotTheme+
theme(legend.position="none")
## Warning: Removed 85 rows containing non-finite outside the scale range
## (`stat_summary()`).
trains %>%
filter(!is.na(occupancy)) %>%
select(occupancy, bearing_from_O_to_D_cat) %>%
gather(Variable, value, -occupancy) %>%
count(Variable, value, occupancy) %>%
ggplot(aes(occupancy, n, fill=occupancy)) +
geom_bar(position="dodge", stat="summary") +
ylim(0,250)+
facet_wrap(~value, scales="free") +
scale_fill_manual(values=colorPallete) +
labs(x="Occupancy", y="Count",
title="Feature associations with the likelihood of low occupancy",
subtitle="Vehicle type for IC, L, P, S, and THA\nTraining data, July - Oct 2016",
caption="Source: Kaggle")+
theme_minimal()+plotTheme+
theme(legend.position="none")
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
trains %>%
filter(!is.na(occupancy)) %>%
select(occupancy, event) %>%
gather(Variable, value, -occupancy) %>%
count(Variable, value, occupancy) %>%
ggplot(aes(occupancy, n, fill=occupancy)) +
geom_bar(position="dodge", stat="summary") +
facet_wrap(~value, scales="free") +
scale_fill_manual(values=colorPallete) +
ylim(0,1200)+
labs(x="Occupancy", y="Count",
title="Feature associations with the likelihood of low occupancy",
subtitle="Event: 0 = Weekday, 1 = Weekend or holiday\nTraining data, July - Oct 2016",
caption="Source: Kaggle")+
theme_minimal()+plotTheme+
theme(legend.position="none")
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
### (6) Categorical Variables: Time category
trains %>%
filter(!is.na(occupancy)) %>%
select(occupancy, time_cat) %>%
gather(Variable, value, -occupancy) %>%
count(Variable, value, occupancy) %>%
ggplot(aes(occupancy, n, fill=occupancy)) +
geom_bar(position="dodge", stat="summary") +
ylim(0,650)+
facet_wrap(~value, scales="free") +
scale_fill_manual(values=colorPallete) +
labs(x="Occupancy", y="Count",
title="Feature associations with the likelihood of low occupancy",
subtitle="0700-0900 and 1700-1900 commute, 1900-2400 evening, 0000-0500 midnight, 0500-0700 morning, 0900-1700 noon\nTraining data, July - Oct 2016",
caption="Source: Kaggle")+
theme_minimal()+plotTheme+
theme(legend.position="none")
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_summary()`).
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
train_summary <- trains %>%
filter(!is.na(occupancy)) %>%
group_by(dotw, time_cat) %>%
summarize(
total_count = n(),
low_count = sum(occupancy == "low", na.rm = TRUE),
low_ratio = low_count / total_count*100
) %>%
ungroup()
## `summarise()` has grouped output by 'dotw'. You can override using the
## `.groups` argument.
ggplot(train_summary, aes(x = dotw, y = low_ratio, fill=time_cat)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(
values = c(
"morning" = "#1f78b4",
"commute time" = "#33a02c",
"noon" = "#e31a1c",
"evening" = "#ff7f00",
"midnight" = "#6a3d9a"
)
) +
labs(
title="Feature associations with the likelihood of occupancy Level",
subtitle="Low occupancy rate by time category of each day\nTraining data, July - Oct 2016",
x = "Day of Week",
y = "Low occupancy rate (%)",
caption="Source: Kaggle"
) +
theme_minimal() +
theme(legend.position = "bottom")+plotTheme
trains %>%
filter(!is.na(occupancy)) %>%
select(occupancy, vehicle_type) %>%
gather(Variable, value, -occupancy) %>%
count(Variable, value, occupancy) %>%
ggplot(aes(occupancy, n, fill=occupancy)) +
geom_bar(position="dodge", stat="summary") +
ylim(0,1000)+
facet_wrap(~value, scales="free") +
scale_fill_manual(values=colorPallete) +
labs(x="Occupancy", y="Count",
title="Feature associations with the likelihood of low occupancy",
subtitle="Vehicle type for IC, L, P, S, and THA\nTraining data, July - Oct 2016",
caption="Source: Kaggle")+
theme_minimal()+plotTheme+
theme(legend.position="none")
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`
threshold <- 3 # Moderate threshold capturing higher-demand OD pairs
# Filter high-occupancy OD pairs
high_routes <- high_occupancy %>%
filter(n_high >= threshold) %>%
left_join(stations, by = c("from_station" = "name")) %>%
rename(from_lon = longitude, from_lat = latitude) %>%
left_join(stations, by = c("to_station" = "name")) %>%
rename(to_lon = longitude, to_lat = latitude)
low_routes <- low_occupancy %>%
filter(n_low >= threshold) %>%
left_join(stations, by = c("from_station" = "name")) %>%
rename(from_lon = longitude, from_lat = latitude) %>%
left_join(stations, by = c("to_station" = "name")) %>%
rename(to_lon = longitude, to_lat = latitude)
# Create line data for leaflet polylines
high_routes <- high_routes %>%
mutate(route_label = paste(from_station, "->", to_station))
low_routes <- low_routes %>%
mutate(route_label = paste(from_station, "->", to_station))
# Step 4: Create line data for leaflet polylines
routes_lines <- high_routes %>%
rowwise() %>%
do({
data.frame(
lng = c(.$from_lon, .$to_lon),
lat = c(.$from_lat, .$to_lat),
route_label = .$route_label
)
})
routes_lines_l <- low_routes %>%
rowwise() %>%
do({
data.frame(
lng = c(.$from_lon, .$to_lon),
lat = c(.$from_lat, .$to_lat),
route_label = .$route_label
)
})
# Create a color palette for the zones
zone_colors <- colorFactor(
palette = "Dark2",
domain = stations_sf$Zone # Map unique zone values
)
# Create Leaflet map
leaflet() %>%
addTiles() %>%
# Add station markers
addCircleMarkers(
data = stations_sf,
label = ~name,
radius = 3,
color = ~zone_colors(Zone),
fillOpacity = 0.1
) %>%
# Add OD pair lines
addPolylines(
data = routes_lines,
lng = ~lng,
lat = ~lat,
color = "#a1b7b8",
opacity = 0.3,
weight = 2,
label = ~route_label
) %>%
addPolylines(
data = routes_lines_l,
lng = ~lng,
lat = ~lat,
color = "#df543b",
opacity = 0.1,
weight = 2,
label = ~route_label
) %>%
# Add legend
addLegend(
"bottomright",
colors = c("black", "#a1b7b8", "#df543b"),
labels = c("Stations", "High Occupancy Routes", "Low Occupancy Routes"),
title = "Routes Map"
)
trains <- trains %>%
mutate(date_numeric = as.numeric(as.Date(date))) %>%
mutate(time_numeric = as.numeric(as_hms(time))) %>%
mutate(occupancy_numeric = case_when(
occupancy == "low" ~ 1,
occupancy == "high" ~ 0,
TRUE ~ NA_real_
))
# Split trains into trains_train and trains_test
trains_train <- trains %>% filter(!is.na(occupancy))
trains_test <- trains %>% filter(is.na(occupancy))
# Split trains_train into trains_trainTrain and trains_trainTest
set.seed(3456)
trainIndex <- createDataPartition(trains_train$occupancy, p = .65,
list = FALSE,
times = 1)
trains_trainTrain <- trains_train[ trainIndex,]
trains_trainTest <- trains_train[-trainIndex,]
reg <- glm(occupancy_numeric ~ ., data =
trains_trainTrain %>% dplyr::select(date_numeric, time_numeric, dotw, occupancy_numeric),
family = "binomial"(link = "logit"))
a<-summary(reg) # / AIC: 1,987
pseudo_r2 <- pR2(reg)
## fitting null model for pseudo-r2
pseudo_r2_df <- as.data.frame(t(pseudo_r2))
kable(pseudo_r2_df, "html", caption = "Pseudo-R² Values for `train logistic model`") %>%
kable_styling(bootstrap_options = c("hover", "striped"))
| llh | llhNull | G2 | McFadden | r2ML | r2CU |
|---|---|---|---|---|---|
| -984.2853 | -1008.389 | 48.20693 | 0.0239029 | 0.0318569 | 0.042939 |
testProbs <- data.frame(Outcome = as.numeric(trains_trainTest$occupancy_numeric),
Probs = predict(reg, trains_trainTest, type= "response"))
#testProbs$Probs_p <- case_when(
# testProbs$Probs >= 0.5 ~ 1,
# testProbs$Probs < 0.5 ~ 0,
# TRUE ~ NA
#)
ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) +
geom_density() +
facet_grid(Outcome ~ .) +
xlim(0,1)+
scale_fill_manual(values = c("#a1b7b8", "#df543b")) +
labs(x = "Prediction of Occupancy", y = "Density of probabilities",
title = "Distribution of predicted probabilities by observed outcome") +
theme(strip.text.x = element_text(size = 18),
legend.position = "none")+plotTheme
ggplot() +
geom_roc(data= testProbs, aes(d = as.numeric(Outcome), m = Probs), n.cuts = 50, labels = FALSE) +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
labs(title = "ROC Curve ",
subtitle = "trains_trainTest that is split of trains_train") +
theme(legend.position = "bottom")+plotTheme
pROC::auc(testProbs$Outcome, testProbs$Probs) # AUC = 0.6277
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.6277
reg <- glm(occupancy_numeric ~ ., data =
trains_trainTrain %>% dplyr::select(date_numeric, time_numeric, dotw, occupancy_numeric,
from_avg_stop_times, from_pop, from_avg_stop_times, to_avg_stop_times, to_pop, from_temp, from_wind, from_humid, to_temp, to_wind, to_humid, event, nr_of_stops, dist_from_O_to_D, dist_from_O_to_Brussels, dist_from_D_to_Brussels, time_cat, vehicle_type, bearing_from_O_to_D_cat),
family = "binomial"(link = "logit"))
b<-summary(reg) # / AIC: 1,651
pseudo_r2 <- pR2(reg)
## fitting null model for pseudo-r2
pseudo_r2_df <- as.data.frame(t(pseudo_r2))
kable(pseudo_r2_df, "html", caption = "Pseudo-R² Values for `train logistic model`") %>%
kable_styling(bootstrap_options = c("hover", "striped"))
| llh | llhNull | G2 | McFadden | r2ML | r2CU |
|---|---|---|---|---|---|
| -788.71 | -883.1156 | 188.8112 | 0.1069006 | 0.1347994 | 0.1816908 |
testProbs <- data.frame(Outcome = (trains_trainTest$occupancy_numeric),
Probs = predict(reg, trains_trainTest, type= "response"))
ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) +
geom_density() +
facet_grid(Outcome ~ .) +
xlim(0,1)+
scale_fill_manual(values = c("#a1b7b8", "#df543b")) +
labs(x = "Prediction of Occupancy", y = "Density of probabilities",
title = "Distribution of predicted probabilities by observed outcome") +
theme(strip.text.x = element_text(size = 18),
legend.position = "none")+plotTheme
testProbs <- testProbs %>%
mutate(predOutcome = (ifelse(Probs >= 0.45,1,0))) # Threshold = 0.45
testProbs %>%
filter( !is.na(predOutcome)) %>%
summarize(observedLow=sum(Outcome), predictedLow=sum(predOutcome)) %>%
gather(Variable, Value) %>%
ggplot(aes(x = Variable, y = Value))+
geom_bar(position="dodge", stat="identity")
ggplot() +
geom_roc(data= testProbs, aes(d = as.numeric(Outcome), m = Probs), n.cuts = 50, labels = FALSE) +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
labs(title = "ROC Curve ",
subtitle = "trains_trainTest that is split of trains_train") +
theme(legend.position = "bottom")+plotTheme
pROC::auc(testProbs$Outcome, testProbs$Probs) # AUC = 0.6764
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.6727
[1] https://github.com/iRail/iRail/blob/master/public/terms/index.html
[1] https://www.rdocumentation.org/packages/riem/versions/0.3.2/topics/riem_measures
[2] https://www.belgiantrain.be/en/support/faq/faq-routes-schedules/faq-train-types
[3] https://www.belgiantrain.be/en/travel-info/train-network-travel-info/calendar-adaptations